Paper presentation @DILS'07

859 views
738 views

Published on

Accelerating Disease Gene Identification Through Integrated SNP Data Analysis

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
859
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • QTL mapping is a powerful tool to determine the role of host genetics in disease phenotypes. In theory the genes involved in any trait that is measurable can be determined, eg weight, height etc. aswell as disease.
  • When studied in detail, this QTL turned out to be two separate peaks associated with different disease phenotypes, however each region was very large. The number of genes in this region is XX – sequencing the whole area would take a long time and a lot of resources. Is this necessary with the information publicly available? BY comparing parental strain SNPs, is it possible to pinpoint the areas of greatest difference between parental strains in order to prioritise the candidate gene search? After all, although 1 SNP in a gene can make a significant difference, 50 SNPs will make more!
  • Inbred strains are genetically similar The arrangement of SNPs across the mouse genome falls into blocks which are common among strains ( haplotypes ) One of the arguments for this being a good strategy is previous work which suggests that inbred strains are not as genetically diverse as previously thought. In fact the arrangement of SNPs across the mouse genome falls into blocks which are common among strains. These blocks define haplotypes (patterns) across the genome, and areas of high or low diversity. This can be demonstrated by the QTL area of chromosome 12 just shown (click for red box). It can be clearly seen that the susceptible mouse (top row – purple box) is different from the two resistant strains (AJ and BALBc – yellow blocks). This is useful because if an offspring inherits a block of susceptible DNA, which is 80% similar to the resistant strain, then the only point of interest will be the 20% that is different (click for blue block as example).
  • A SNP in which the allele for the selected strain is different from that observed in all the others supports the hypothesis that the SNP plays a role in the phenotype associated with the selected strain; the SNP should therefore receive a high score
  • One of the arguments for this being a good strategy is previous work which suggests that inbred strains are not as genetically diverse as previously thought. In fact the arrangement of SNPs across the mouse genome falls into blocks which are common among strains. These blocks define haplotypes (patterns) across the genome, and areas of high or low diversity. This can be demonstrated by the QTL area of chromosome 12 just shown (click for red box). It can be clearly seen that the susceptible mouse (top row – purple box) is different from the two resistant strains (AJ and BALBc – yellow blocks). This is useful because if an offspring inherits a block of susceptible DNA, which is 80% similar to the resistant strain, then the only point of interest will be the 20% that is different (click for blue block as example).
  • Paper presentation @DILS'07

    1. 1. Accelerating Disease Gene Identification Through Integrated SNP Data Analysis Paolo Missier , S. Embury, C. Hedeler, M. Greenwood School of Computer Science, University of Manchester, UK J. Pennock, A. Brass School of Biological Sciences, University of Manchester, UK DILS ’07, Philadelphia, USA
    2. 2. Overall goal <ul><ul><li>Add value to existing public SNP databases </li></ul></ul><ul><ul><li>Support multiple experimental added-value SNP analysis packages </li></ul></ul><ul><li>Core application: </li></ul><ul><li>improving the search for candidate gene selection in quantitative trait analysis </li></ul><ul><ul><li>Analysis of genetic factors in observed quantitative phenotypes </li></ul></ul><ul><ul><ul><li>resistance / susceptibility to a certain disease </li></ul></ul></ul><ul><ul><ul><li>life span, weight, … </li></ul></ul></ul>Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)
    3. 3. Example: study on a parasite worm <ul><li>Trichuris trichiura </li></ul><ul><li>Trichuris muris </li></ul><ul><li>Same life cycle </li></ul><ul><li>Natural parasite of mice </li></ul>Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254 ‘’…… .28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’
    4. 4. Finding candidate genes <ul><li>Candidate gene determination is an area of active research </li></ul><ul><ul><li>(A. Chakravarti. Population genetics – making sense out of sequence. Nature Genetics, 21(Suppl. 1), January 1999) </li></ul></ul><ul><li>Current methodology involves QTL mapping </li></ul><ul><ul><li>Experimental method to correlate quantitative phenotype with genotype </li></ul></ul><ul><ul><li>Associates a region on the chromosome to a specific phenotype through complex in-breeding schemes </li></ul></ul>Mixed responders Resistant Susceptible
    5. 5. The challenge <ul><li>A QTL may contain hundreds or thousands of genes </li></ul><ul><li>Quantitative phenotypes are often polygenic </li></ul><ul><li>Determination of candidate genes is a difficult and slow process </li></ul>Example QTL (chr 12) Automation is needed to narrow the scope of the search to a manageable size
    6. 6. SNPs and their role in QT analysis <ul><li>SNP: Single Nucleotide Polymorphism </li></ul><ul><ul><li>single-base change in a strain relative to a reference strain (mus musculus) </li></ul></ul><ul><li>Strategy </li></ul><ul><ul><li>Identify areas of greatest difference between resistant / susceptible strains </li></ul></ul><ul><ul><li>Prioritize candidate gene search using the density of highly differentiated gene regions </li></ul></ul>Priority region
    7. 7. SNP informativeness <ul><li>Rank SNPs according to strain differences </li></ul><ul><li>Strain allele : nucleotide base replacement for a SNP observed in a single strain </li></ul>Strain group 1 (resistant) Strain group 2 (susceptible) <ul><li>Group score model: </li></ul><ul><li>Compare susceptible strains vs resistant strains </li></ul><ul><li>Perfect score: </li></ul><ul><li>Disjoint sets of alleles </li></ul><ul><li>No missing alleles </li></ul>
    8. 8. Group strain score model Strains Corresponding alleles For each SNP : Common, distinct non-null alleles Distinct non-null alleles in A 1 , A 2 : Penalties:
    9. 9. Example
    10. 10. Score model performance <ul><li>No standard test dataset </li></ul><ul><li>Criteria: evaluate ranking of polymorphic genes </li></ul><ul><ul><li>Based on known candidate genes for HDL (cholesterol) QTL regions </li></ul></ul><ul><li>From SNP scores to gene scores: </li></ul><ul><ul><li>High-score SNP density / total SNP density </li></ul></ul>
    11. 11. Score selectivity <ul><li>HDL data on Perlegen </li></ul><ul><li>{ CAST/EiJ, C57BL/6J } vs { C3H/HeJ, FVB/NJ } </li></ul>7090 / 101,896 = 6.9% Translates to < 20 candidate genes
    12. 12. The SNPit project <ul><li>A &quot;lightweight&quot; SNP database designed to support genetic research </li></ul><ul><ul><li>gene identification in QTLs is one application </li></ul></ul><ul><ul><li>Hope to answer a broader array of research questions beyond QTL analysis </li></ul></ul><ul><li>SNPit is a secondary DB </li></ul><ul><ul><li>Primary sources: Ensembl (EBI), dbSNP (NCBI), Perlegen </li></ul></ul><ul><li>Others available, not considered in this study MGD – see Nucl. Acids Res., 35(Database issue), 2007 UCSC – see Nucl. Acids Res., 35(Database issue), 2007 Wellcome-CTC Mouse Strain SNP Genotype Set MPD – Mouse Phenome Database </li></ul>
    13. 13. SNPit application challenges <ul><li>Supports interactive exploratory analysis over large regions </li></ul><ul><li>Over 50Mb –200K SNPs/source/session </li></ul><ul><li>Typical flow: </li></ul><ul><li>Region selection (or gene set) </li></ul><ul><li>Source selection (multiple) </li></ul><ul><li>Strain group selection – per-session basis </li></ul><ul><li>Compute score for each SNP in the region – on the fly </li></ul><ul><li>(filter by gene polymorphism) </li></ul><ul><li>Rank SNPs by score, gene polymorphism – in-memory sorting </li></ul><ul><li>Plot density of high-score SNPs over the selected region </li></ul><ul><li>Change parameters and repeat… </li></ul>Response times typically within 30secs on a Tomcat deployment, high-end server with co-located DBMS (mySQL)
    14. 14. Why multiple SNP DBs <ul><li>SNP databases differ </li></ul><ul><ul><li>Partially overlap in structure and content </li></ul></ul><ul><ul><li>Different update policy and frequency </li></ul></ul><ul><li>Biologists like to choose their sources </li></ul><ul><ul><li>Based on experience, prior usage, confidence </li></ul></ul><ul><li>The SNPit application offers an explicit choice </li></ul><ul><li>It exploits complementary features and content of the DBs </li></ul>
    15. 15. Data architecture SNPit DB Ensembl SNP dbSNP Perlegen SNPit Web app SNPit Web Service load load load Periodic updates rsId ssId Perlegen dbSNP Ensembl Interdependent materialized views <ul><li>no single global schema </li></ul><ul><li>queries against the views </li></ul><ul><li>some queries can be directed to more than one DB </li></ul><ul><li>End-user web app </li></ul><ul><li>Web Service accessible as a workflow processor (Taverna) </li></ul>Core Data processing Score 2 … Score 1
    16. 16. SNPit access from a workflow
    17. 17. SNP DB dependencies Ensembl Mouse (407,000) NCBI dbSNP Perlegen Public submission from multiple sources join rsId rsId ssId ssId join SNPs Strain alleles SNP Provenance Multiple SNPs strains Load Load Load Sanger institute Primary sources Updates Updates Tot 407,000 Tot 420,000 147,000 146,000 14,000 133,000 132,000 (420,000) (all figures relative to chromosome 12)
    18. 18. Qualitative differences 16 strains (ref + 15) Fairly complete <ul><li>No SNP location </li></ul><ul><li>Not evolving </li></ul><ul><li>Good quality control </li></ul><ul><li>High reputation </li></ul>Perlegen <ul><li>Multiple sources </li></ul><ul><li>Low quality control on public submission </li></ul><ul><li>Timely </li></ul>Low timeliness Weaknesses Not used <ul><li>Submitter info </li></ul><ul><li>Update history (provenance) </li></ul>dbSNP About 60 strains Not very complete <ul><li>Curated SNPs </li></ul><ul><li>Evolving </li></ul><ul><li>SNP location info (exonic, intronic) </li></ul><ul><li>Multiple reputable sources </li></ul><ul><li>Controlled submission </li></ul>Ensembl Strain info Strengths
    19. 19. Missing strains – chr 17
    20. 20. Effect of source selection Ensembl Perlegen
    21. 21. Summary <ul><li>SNPit complements current methodologies for candidate gene discovery in QTL regions </li></ul><ul><ul><li>Helps focusing on promising genes </li></ul></ul><ul><ul><li>Automates SNP analysis over large regions </li></ul></ul><ul><li>View-based, loose integration of three prominent DBs </li></ul><ul><li>Original score models </li></ul><ul><ul><li>More study needed to exploit other features </li></ul></ul><ul><ul><ul><li>SNP location, submitter info, revision frequency… </li></ul></ul></ul><ul><li>Can be invoked from workflows </li></ul><ul><ul><li>As part of larger in silico experiments </li></ul></ul><ul><li>Plan to release SNPit as a public Web Service </li></ul>
    22. 23. SNPs and their role in QT analysis <ul><li>SNP: Single Nucleotide Polymorphism </li></ul><ul><ul><li>single-base change in a strain relative to a reference strain (mus musculus) </li></ul></ul><ul><li>Inbred strains are genetically similar </li></ul><ul><li>The arrangement of SNPs across the mouse genome falls into blocks which are common among strains ( haplotypes ) </li></ul><ul><li>ex.: C57 strain (susceptible) different from A/J and BALBc strains (resistant) </li></ul>
    23. 24. DB overlaps Perlegen 291,718 Ensembl 253,862 dbSNP (Chromosome 17) 50,564 122,938 105,265 243,702

    ×