SlideShare a Scribd company logo
1 of 28
Integrating Public and Private Data
With a Focus on Genome-Wide Association Studies


Hans-Martin Will, Ph.D.
Sr. Director, Head of Genomics R&D
Rosetta Biosoftware

02-27-2009
Rosetta Biosoftware

• Provider of commercial informatics
  data-management and analysis
  solutions with 10 years of commercial
  presence


• Enabling
  biopharmaceutical, academic and
  government organizations with
  solutions to drive research and
  innovative science forward
                                          Seattle, London and Tokyo

• Close collaborations with
  customers, Merck scientists, and
  FDA

                                                                      3
Core of this Presentation




           Integrating Public Data Sets as Valuable Resource
                        into Internal R&D Efforts
      with a Focus on Genome-Wide Association Studies (GWAS)




                Challenges and Lessons Learned from
           Integrating a lot of Structurally Similar Data Sets




                                                                 4
• For statistical geneticists, biologists, and genetics data
  producers/service providers
• Is a scalable repository to organize, analyze, mine, genomics study
  data
• Integrates data from the public domain with your proprietary data
  across technologies
• Is built on an open platform that integrates your analysis tools of
  choice while avoiding time spent on data formatting
• Designed to work in conjunction with other Rosetta Biosoftware
  products to maintain your prior investments



                                                                        5
Genome Browser
                  +
         Google-style Search
                  +
Many Integrated Genomics Data Sources




                                        6
Genome Wide Association Studies
First Successes and Adoption Curve


• Race for the first major GWAS (type 2 diabetes) won by a group from
  Genome Quebec centre in McGill University in collaboration with
  Imperial College London. Tested 400,000 markers; identified
  associations with 2 genes.

• 5 months later WTCCC carried out genome-wide association studies
  for 7 common diseases with known significant familiar component:
  coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid
  arthritis, Crohn's disease, bipolar disorder and hypertension.

• WTCCC study has become an “instant classic”; in less than 1 year it
  has been referenced by 346 publications [source: WTCCC].




                                                                         7
Impact: WTCCC Citations Over Time
New Publications Per Quarter




                                    8
Over 16,000 significant and suggestive SNPs found in the analysis

Only 194 SNPs reported in the study publication




                                                                    9
Currently Available GWAS Data


•   “In the past 2 years, there has been a dramatic increase in genomic
    discoveries involving complex, non-Mendelian diseases, with nearly
    100 loci for as many as 40 common diseases robustly identified and
    replicated in genome-wide association (GWA) studies” [T.A.
    Pearson, T.A. Manolio, JAMA2008, 299(11):1335-1344]

•   Number of participants profiled to date in the public domain is
    approaching 100,000 (70,000 individuals in dbGAP alone)

•   $100M’s are spent on GWAS in public domain. Expenditures in
    private domain (big pharma, hospitals, consumer healthcare
    companies) at likely at the same or higher level



                                                                      10
GWA Studies Available from NCBI dbGAP
Portal
A ging                             C oronary D isease                            M etacarpal B ones
A lzheim er D isease               D eath                                        M otor N euron D isease
A m yotrophic Lateral S clerosis   D em entia                                    M uscles, S keletal
A ngina P ectoris                  D iabetes M ellitus                           M uscular A trophy, S pinal
A sthm a                           D iabetes M ellitus, T ype 2                  M uscular D isease
A therosclerosis                   D iabetic N ephropathy                        M yocardial Infarction
A trial Fibrillation               Fatty Liver                                   N euroblastom a
A trial Flutter                    G laucom a                                    O besity
A D D with H yperactivity          H eart D iseases                              O steoarthritis
B ipolar D isorder                 H eart Failure                                O steoporosis
B one D ensity                     H eart Failure, C ongestive                   P arkinson D isease
B ones of Lower E xtrem ity        H eart V alve P rosthesis                     P soriasis
B rain Infarction                  H eart V alve P rosthesis Im plantation       P soriatic A rthritis
B rain Ischem ia                   H orm one R eplacem ent T herapies            P ulm onary D isease
B ulbar P alsy, P rogressive       H ypertension                                 R etinopathy
C ardiom yopathies                 H ypertrophy, Left V entricular               R habdom yolysis
C ardiovascular D isease           H ysterectom y                                R isk Factors
C ardiovascular S ystem            Interm ittent C laudication                   S chizophrenia
C ataract                          Intracranial A neurysm                        S leep
C erebrovascular A ccident         Intracranial A rteriovenous M alform ations   S leep A pnea S yndrom es
C erebrovascular D isease          Ischem ic A ttack, T ransient                 S leep A pnea, O bstructive
C erebrovascular D isorders        Lupus E rythem atosus, S ystem ic             S m oking
C holesterol                       M acular D egeneration                        S troke
C ongestive H eart Failure         M ajor D epressive D isorder
C oronary A rtery B ypass          M enopause




                                                                                                               11
Primary GWAS Publications


                               200
Total Number of Publications




                               100


                                                  WTCCC




                                 0
                                 2005   2006   2007       2008                    2009


                                                          Data from NHGRI GWAS Catalogue (Jan 2009)


                                                                                                      12
How can Internal Research Benefit from these Data?




• Validation of internal findings

• Use result sets as gateway into research literature

• Extend biological context into other disease areas

• Enrich internal data sets for meta-analyses




                                                        13
Example: Increase the Statistical Power of Association
Studies
• Original report by Fung et al.:
                                                   100
  Lancet Neurology 2006; 911-
                                                     90
  916                                                80
    • Parkinson’s study contained 276                70
      cases vs. 276 controls                         60




                                          *Power
    • Identified 26 SNPs with                        50
      association P-value of <0.0001                 40
                                                     30

• Expanded Study
                                                     20

                                                     10
    • Use Illumina iControl DB Study 67               0
      and 64 data to increase power of                      270        500        750       1000
      the study                                                     Control individuals

    • Powered study: 267 cases vs.                 * Assumes disease allele frequency of 0.50. Calculated
      1,641 controls                               with CaTS software from: Skol AD, Scott LJ, Abecasis
                                                   GR and Boehnke M, Nat. Genet. (2006) 38:209-13
    • Identified 114 SNPs with
      association p-value of p<0.0001




                                                                                                            14
Example: Increase the Statistical Power of GWAS

                                                    Extension of data set


       Excluded from analysis

                                                               2
                                                                             110
                                                   32

                                                               1
                                                                       1
                                                        16
                                      Difference
                                      in methods
                                                                8




                                                         Published results
          Cases
                                                         Re-analyzed using PLINK
          Controls (both data sets)
                                                         Combined data using PLINK



                                                                                     15
Integration Across the Data Pyramid


                                                        Knowledge
                          Public Data                                                Private Data
Level of Abstraction




                                                    Integrative Results
                                                    Pathways, Networks


                                            Domain Specific Analysis Results
                                            Associations, Correlations, Clusters


                                   Domain Specific Raw Assay Data
                                   Clinical Measurements, Genotypes, Expression Profiles,
                                   Sequences, …



                                                                                                    16
Challenges




      Privacy       Data
                                IT Support
      Issues    Heterogeneity




                                             17
Privacy Issues

                                             Privacy Concerns
                                             • Research participants are sensitive about
                                               personal information
                                               • Presence of disease risk
                                               • Paternity
                                               • Ancestry




         Freedom of Research
         • Potential for public benefit of
           genetic research is widely
           acknowledged



• General agreement that protecting privacy is critical
    •Genetic Information Non-discrimination Act)
• Risks must be mitigated and balanced with the potential benefits
• Currently, this results in a lengthy approval process for access to data
                                                                                           18
19
Challenges: Data Heterogeneity

Plethora of data formats
• Lack of standards for certain domains
• Too many competing standards
  • MAGE-ML, SOFT, MiniML, MAGE-TAB, ISA-TAB, …



     Taxonomies and vocabularies
     • Overlapping scope (LOINC vs. MEDRA)
     • Inconsistent and incomplete use of vocabularies
       (“comments”)




           Statistical methodology
           • What does a p-value express?
           • How do findings translate across studies?



                                                         20
Data Deluge




              21
STATISTICIAN
                                     REQUIRED !




Source: WTCCC Supporting Material

                                                   22
Challenges: IT Support

                                  Data Transfer
Policies and infrastructure are designed      GSK cell line data > 2 days transfer
  to prevent large-scale data access             through corporate firewall




                                  Data Storage
                                            Resources already stretched by internal
  Internal replication of all public data
                                                       data generation




                                Data Processing
      Reprocessing pipeline requires extensive compute resources (cluster)


                                                                                      23
Summary
Preliminary


• Publicly available data are an underutilized resource of great value

• Heterogeneity of formats, annotations and methodology constitute a
  substantial hurdle to integrate these data into research

• Organizations seeking to create in-house compendia based on
  publicly available data need to be prepared for significant investments
  in staff and infrastructure




                                                                            24
Preview: FDA SNPTrack

• What is it?
    • Public repository and publicly-
      available client for deposit of and
      access to GWAS data
    • Infrastructure for submission and
      review of (voluntary) genomic data
      submissions

• What are the objectives?
    • Open collaboration around best               Collaboration
      practices (complements MAQC)
    • Platform for exchange of and
      access to data of interest
    • Enablement of large-scale meta-
      analysis
                                                     SNPTrack
• More details will be provided by
  Weida Tong later today                    ♯   Currently operating under LOI

                                                                                25
Public FDA SNPTrack Portal
As Collaborative Effort across the Industry

                       Public SNPTrack Portal
                        GWAS Data Sets
                        GWAS Result Sets
                        GWAS Methods



                    Common Data Formats and Quality
                             Standards
                   Common Data Analysis Methodology




     Academic                                           FDA
    Researcher                                        Reviewer
                               BioPharma
                               Researcher

                                                                 26
Summary & Outlook


• Publicly available data are an underutilized resource of great value

• Heterogeneity of formats, annotations and methodology constitute a
  substantial hurdle to integrate these data into research

• Efforts such as the development of the FDA SNPTrack System and
  the MAQC facilitate collaboration and discussions driving towards data
  harmonization

• Industry-wide effort is needed to effectively solve these issues




                                                                         27
Acknowledgements

• FDA                 • Merck Research Labs
   •   Weida Tong        – Jason Johnson
   •   Hong Fang
                         – AndreyLoboda
   •   Joshua Xu
   •   Steve Harris
                      • Rosetta Biosoftware
                         –   AsaOudes
                         –   Carol Preisig
                         –   Kristen Stoops
                         –   Michael Rosenberg
                         –   Yelena Shevelenko




                                                 28

More Related Content

Viewers also liked

Viewers also liked (8)

Nadal
NadalNadal
Nadal
 
Murals Tardor 09
Murals Tardor 09Murals Tardor 09
Murals Tardor 09
 
Carnaval Bloc
Carnaval BlocCarnaval Bloc
Carnaval Bloc
 
Animals 3r B
Animals 3r BAnimals 3r B
Animals 3r B
 
videos de alejo en power
videos de alejo en powervideos de alejo en power
videos de alejo en power
 
Sant Jordi 2009
Sant Jordi 2009Sant Jordi 2009
Sant Jordi 2009
 
videos de alejo En Power
videos de alejo  En Powervideos de alejo  En Power
videos de alejo En Power
 
Tapa Projecte Tortuga (3r)
Tapa Projecte Tortuga (3r)Tapa Projecte Tortuga (3r)
Tapa Projecte Tortuga (3r)
 

Similar to CHI MMTC Integrating Public and Private Data

Atul Butte's presentation for the FDA 5th Annual Scientific Computing Days
Atul Butte's presentation for the FDA 5th Annual Scientific Computing DaysAtul Butte's presentation for the FDA 5th Annual Scientific Computing Days
Atul Butte's presentation for the FDA 5th Annual Scientific Computing DaysUniversity of California, San Francisco
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Databiobase
 
European Perspectives in Personalised Medicine
European Perspectives in Personalised MedicineEuropean Perspectives in Personalised Medicine
European Perspectives in Personalised MedicineEuroBioForum
 
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...SvetlaBoytcheva
 
Extreme scale text based classification of medical data
Extreme scale text based classification of medical dataExtreme scale text based classification of medical data
Extreme scale text based classification of medical dataSvetlaBoytcheva
 
Text mining and deep learning for biomedicine
Text mining and deep learning for biomedicineText mining and deep learning for biomedicine
Text mining and deep learning for biomedicineZhiyong Lu, PhD FACMI
 
20190615-資料科學與基因體研究的應用
20190615-資料科學與基因體研究的應用20190615-資料科學與基因體研究的應用
20190615-資料科學與基因體研究的應用tsai cheng yu
 
의료 빅데이터와 인공지능의 현재와 미래
의료 빅데이터와 인공지능의 현재와 미래의료 빅데이터와 인공지능의 현재와 미래
의료 빅데이터와 인공지능의 현재와 미래Hyung Jin Choi
 
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Nils Gehlenborg
 
Volume overhydration in dialysis patients
Volume overhydration in dialysis patientsVolume overhydration in dialysis patients
Volume overhydration in dialysis patientsdoremi78
 
10 rencontres biomédicale LIR Faiez Zannad
10 rencontres biomédicale LIR Faiez Zannad10 rencontres biomédicale LIR Faiez Zannad
10 rencontres biomédicale LIR Faiez ZannadAssociation LIR
 
Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020RedChip Companies, Inc.
 
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesMining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesJeremy Yang
 
Acs finding promiscuous old drugs for new uses-final
Acs   finding promiscuous old drugs for new uses-finalAcs   finding promiscuous old drugs for new uses-final
Acs finding promiscuous old drugs for new uses-finalSean Ekins
 
Indications discovery and drug repurposing
Indications discovery and drug repurposingIndications discovery and drug repurposing
Indications discovery and drug repurposingSean Ekins
 
Overcoming the challenges of molecular diagnostics in government health insti...
Overcoming the challenges of molecular diagnostics in government health insti...Overcoming the challenges of molecular diagnostics in government health insti...
Overcoming the challenges of molecular diagnostics in government health insti...Yakubu Sunday Bot
 
10/16 AACR Team Science Presentation
10/16 AACR Team Science Presentation10/16 AACR Team Science Presentation
10/16 AACR Team Science PresentationUCLA CTSI
 

Similar to CHI MMTC Integrating Public and Private Data (20)

Atul Butte's presentation for the FDA 5th Annual Scientific Computing Days
Atul Butte's presentation for the FDA 5th Annual Scientific Computing DaysAtul Butte's presentation for the FDA 5th Annual Scientific Computing Days
Atul Butte's presentation for the FDA 5th Annual Scientific Computing Days
 
2014 07 ismb personalized medicine
2014 07 ismb personalized medicine2014 07 ismb personalized medicine
2014 07 ismb personalized medicine
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
European Perspectives in Personalised Medicine
European Perspectives in Personalised MedicineEuropean Perspectives in Personalised Medicine
European Perspectives in Personalised Medicine
 
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...
DSS Ontotext Webinar -Examode: Extreme-scale text-based classification of med...
 
Extreme scale text based classification of medical data
Extreme scale text based classification of medical dataExtreme scale text based classification of medical data
Extreme scale text based classification of medical data
 
Text mining and deep learning for biomedicine
Text mining and deep learning for biomedicineText mining and deep learning for biomedicine
Text mining and deep learning for biomedicine
 
20190615-資料科學與基因體研究的應用
20190615-資料科學與基因體研究的應用20190615-資料科學與基因體研究的應用
20190615-資料科學與基因體研究的應用
 
Innovators Forum
Innovators ForumInnovators Forum
Innovators Forum
 
의료 빅데이터와 인공지능의 현재와 미래
의료 빅데이터와 인공지능의 현재와 미래의료 빅데이터와 인공지능의 현재와 미래
의료 빅데이터와 인공지능의 현재와 미래
 
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
 
Volume overhydration in dialysis patients
Volume overhydration in dialysis patientsVolume overhydration in dialysis patients
Volume overhydration in dialysis patients
 
10 rencontres biomédicale LIR Faiez Zannad
10 rencontres biomédicale LIR Faiez Zannad10 rencontres biomédicale LIR Faiez Zannad
10 rencontres biomédicale LIR Faiez Zannad
 
Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020Annovis Bio (ANVS) Presentation - April 9, 2020
Annovis Bio (ANVS) Presentation - April 9, 2020
 
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesMining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
 
Acs finding promiscuous old drugs for new uses-final
Acs   finding promiscuous old drugs for new uses-finalAcs   finding promiscuous old drugs for new uses-final
Acs finding promiscuous old drugs for new uses-final
 
Indications discovery and drug repurposing
Indications discovery and drug repurposingIndications discovery and drug repurposing
Indications discovery and drug repurposing
 
Overcoming the challenges of molecular diagnostics in government health insti...
Overcoming the challenges of molecular diagnostics in government health insti...Overcoming the challenges of molecular diagnostics in government health insti...
Overcoming the challenges of molecular diagnostics in government health insti...
 
10/16 AACR Team Science Presentation
10/16 AACR Team Science Presentation10/16 AACR Team Science Presentation
10/16 AACR Team Science Presentation
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 

CHI MMTC Integrating Public and Private Data

  • 1.
  • 2. Integrating Public and Private Data With a Focus on Genome-Wide Association Studies Hans-Martin Will, Ph.D. Sr. Director, Head of Genomics R&D Rosetta Biosoftware 02-27-2009
  • 3. Rosetta Biosoftware • Provider of commercial informatics data-management and analysis solutions with 10 years of commercial presence • Enabling biopharmaceutical, academic and government organizations with solutions to drive research and innovative science forward Seattle, London and Tokyo • Close collaborations with customers, Merck scientists, and FDA 3
  • 4. Core of this Presentation Integrating Public Data Sets as Valuable Resource into Internal R&D Efforts with a Focus on Genome-Wide Association Studies (GWAS) Challenges and Lessons Learned from Integrating a lot of Structurally Similar Data Sets 4
  • 5. • For statistical geneticists, biologists, and genetics data producers/service providers • Is a scalable repository to organize, analyze, mine, genomics study data • Integrates data from the public domain with your proprietary data across technologies • Is built on an open platform that integrates your analysis tools of choice while avoiding time spent on data formatting • Designed to work in conjunction with other Rosetta Biosoftware products to maintain your prior investments 5
  • 6. Genome Browser + Google-style Search + Many Integrated Genomics Data Sources 6
  • 7. Genome Wide Association Studies First Successes and Adoption Curve • Race for the first major GWAS (type 2 diabetes) won by a group from Genome Quebec centre in McGill University in collaboration with Imperial College London. Tested 400,000 markers; identified associations with 2 genes. • 5 months later WTCCC carried out genome-wide association studies for 7 common diseases with known significant familiar component: coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension. • WTCCC study has become an “instant classic”; in less than 1 year it has been referenced by 346 publications [source: WTCCC]. 7
  • 8. Impact: WTCCC Citations Over Time New Publications Per Quarter 8
  • 9. Over 16,000 significant and suggestive SNPs found in the analysis Only 194 SNPs reported in the study publication 9
  • 10. Currently Available GWAS Data • “In the past 2 years, there has been a dramatic increase in genomic discoveries involving complex, non-Mendelian diseases, with nearly 100 loci for as many as 40 common diseases robustly identified and replicated in genome-wide association (GWA) studies” [T.A. Pearson, T.A. Manolio, JAMA2008, 299(11):1335-1344] • Number of participants profiled to date in the public domain is approaching 100,000 (70,000 individuals in dbGAP alone) • $100M’s are spent on GWAS in public domain. Expenditures in private domain (big pharma, hospitals, consumer healthcare companies) at likely at the same or higher level 10
  • 11. GWA Studies Available from NCBI dbGAP Portal A ging C oronary D isease M etacarpal B ones A lzheim er D isease D eath M otor N euron D isease A m yotrophic Lateral S clerosis D em entia M uscles, S keletal A ngina P ectoris D iabetes M ellitus M uscular A trophy, S pinal A sthm a D iabetes M ellitus, T ype 2 M uscular D isease A therosclerosis D iabetic N ephropathy M yocardial Infarction A trial Fibrillation Fatty Liver N euroblastom a A trial Flutter G laucom a O besity A D D with H yperactivity H eart D iseases O steoarthritis B ipolar D isorder H eart Failure O steoporosis B one D ensity H eart Failure, C ongestive P arkinson D isease B ones of Lower E xtrem ity H eart V alve P rosthesis P soriasis B rain Infarction H eart V alve P rosthesis Im plantation P soriatic A rthritis B rain Ischem ia H orm one R eplacem ent T herapies P ulm onary D isease B ulbar P alsy, P rogressive H ypertension R etinopathy C ardiom yopathies H ypertrophy, Left V entricular R habdom yolysis C ardiovascular D isease H ysterectom y R isk Factors C ardiovascular S ystem Interm ittent C laudication S chizophrenia C ataract Intracranial A neurysm S leep C erebrovascular A ccident Intracranial A rteriovenous M alform ations S leep A pnea S yndrom es C erebrovascular D isease Ischem ic A ttack, T ransient S leep A pnea, O bstructive C erebrovascular D isorders Lupus E rythem atosus, S ystem ic S m oking C holesterol M acular D egeneration S troke C ongestive H eart Failure M ajor D epressive D isorder C oronary A rtery B ypass M enopause 11
  • 12. Primary GWAS Publications 200 Total Number of Publications 100 WTCCC 0 2005 2006 2007 2008 2009 Data from NHGRI GWAS Catalogue (Jan 2009) 12
  • 13. How can Internal Research Benefit from these Data? • Validation of internal findings • Use result sets as gateway into research literature • Extend biological context into other disease areas • Enrich internal data sets for meta-analyses 13
  • 14. Example: Increase the Statistical Power of Association Studies • Original report by Fung et al.: 100 Lancet Neurology 2006; 911- 90 916 80 • Parkinson’s study contained 276 70 cases vs. 276 controls 60 *Power • Identified 26 SNPs with 50 association P-value of <0.0001 40 30 • Expanded Study 20 10 • Use Illumina iControl DB Study 67 0 and 64 data to increase power of 270 500 750 1000 the study Control individuals • Powered study: 267 cases vs. * Assumes disease allele frequency of 0.50. Calculated 1,641 controls with CaTS software from: Skol AD, Scott LJ, Abecasis GR and Boehnke M, Nat. Genet. (2006) 38:209-13 • Identified 114 SNPs with association p-value of p<0.0001 14
  • 15. Example: Increase the Statistical Power of GWAS Extension of data set Excluded from analysis 2 110 32 1 1 16 Difference in methods 8 Published results Cases Re-analyzed using PLINK Controls (both data sets) Combined data using PLINK 15
  • 16. Integration Across the Data Pyramid Knowledge Public Data Private Data Level of Abstraction Integrative Results Pathways, Networks Domain Specific Analysis Results Associations, Correlations, Clusters Domain Specific Raw Assay Data Clinical Measurements, Genotypes, Expression Profiles, Sequences, … 16
  • 17. Challenges Privacy Data IT Support Issues Heterogeneity 17
  • 18. Privacy Issues Privacy Concerns • Research participants are sensitive about personal information • Presence of disease risk • Paternity • Ancestry Freedom of Research • Potential for public benefit of genetic research is widely acknowledged • General agreement that protecting privacy is critical •Genetic Information Non-discrimination Act) • Risks must be mitigated and balanced with the potential benefits • Currently, this results in a lengthy approval process for access to data 18
  • 19. 19
  • 20. Challenges: Data Heterogeneity Plethora of data formats • Lack of standards for certain domains • Too many competing standards • MAGE-ML, SOFT, MiniML, MAGE-TAB, ISA-TAB, … Taxonomies and vocabularies • Overlapping scope (LOINC vs. MEDRA) • Inconsistent and incomplete use of vocabularies (“comments”) Statistical methodology • What does a p-value express? • How do findings translate across studies? 20
  • 22. STATISTICIAN REQUIRED ! Source: WTCCC Supporting Material 22
  • 23. Challenges: IT Support Data Transfer Policies and infrastructure are designed GSK cell line data > 2 days transfer to prevent large-scale data access through corporate firewall Data Storage Resources already stretched by internal Internal replication of all public data data generation Data Processing Reprocessing pipeline requires extensive compute resources (cluster) 23
  • 24. Summary Preliminary • Publicly available data are an underutilized resource of great value • Heterogeneity of formats, annotations and methodology constitute a substantial hurdle to integrate these data into research • Organizations seeking to create in-house compendia based on publicly available data need to be prepared for significant investments in staff and infrastructure 24
  • 25. Preview: FDA SNPTrack • What is it? • Public repository and publicly- available client for deposit of and access to GWAS data • Infrastructure for submission and review of (voluntary) genomic data submissions • What are the objectives? • Open collaboration around best Collaboration practices (complements MAQC) • Platform for exchange of and access to data of interest • Enablement of large-scale meta- analysis SNPTrack • More details will be provided by Weida Tong later today ♯ Currently operating under LOI 25
  • 26. Public FDA SNPTrack Portal As Collaborative Effort across the Industry Public SNPTrack Portal  GWAS Data Sets  GWAS Result Sets  GWAS Methods Common Data Formats and Quality Standards Common Data Analysis Methodology Academic FDA Researcher Reviewer BioPharma Researcher 26
  • 27. Summary & Outlook • Publicly available data are an underutilized resource of great value • Heterogeneity of formats, annotations and methodology constitute a substantial hurdle to integrate these data into research • Efforts such as the development of the FDA SNPTrack System and the MAQC facilitate collaboration and discussions driving towards data harmonization • Industry-wide effort is needed to effectively solve these issues 27
  • 28. Acknowledgements • FDA • Merck Research Labs • Weida Tong – Jason Johnson • Hong Fang – AndreyLoboda • Joshua Xu • Steve Harris • Rosetta Biosoftware – AsaOudes – Carol Preisig – Kristen Stoops – Michael Rosenberg – Yelena Shevelenko 28

Editor's Notes

  1. Aggregate Results