Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. interPopula: Database and tool integration for population genetics With a focus on the HapMap project ˜ Tiago Rodrigues Antao Liverpool School of Tropical Medicine, UK interPopula – p.
  2. 2. Preamble – the HapMap project (and UCSC Known Genes) interPopula – p.
  3. 3. HapMap The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available. interPopula – p.
  4. 4. What is there? 11 pops, 90–180 individuals/pop (some cases with family trios), >3M SNPs Frequencies (e.g. for population P and SNP S, there are 30% of As and 70% of Cs) Genotypes (data per individual) Phasing data Pedigree info LD (linkage disequilibrium) computations Copy Number Variation (CNV) info – New! A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851-861. 2007. interPopula – p.
  5. 5. UCSC Known Genes A gene set constructed by an automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank Inside UCSC Genome Browser Not only for humans (but options limited, less than a handful of species) Really useful for HapMap data (allows to relate SNPs with gene information in a much easier way than Entrez SNP) Hsu et al, Bioinformatics, 2006 22(9):1036-1046 (but see Genome Browser updates on NAR) interPopula – p.
  6. 6. We now return to our regularly scheduled program – interPopula interPopula – p.
  7. 7. Introduction – 1 A Python library to access HapMap and UCSC Known Genes data A set of scripts providing integration examples. Integrating interPopula with Biopython, matplotlib, Genepop and Entrez SNP. Interaction with the ecology of PopGen databases and Python tools encouraged A set of guidelines to deal with inconsistencies across databases Very easy to use, many examples For Perl: Ensembl Variation API (Rios et al. BMC Bioinformatics 2010, 11:238) interPopula – p.
  8. 8. Introduction – 2 Python (2.6) based. Test coverage very high Uses sqlite (Python built-in, no extra dependencies) Creates a local SQL database from ftp data files Can be disk and network intensive Intelligent download: on-demand and never repeats the same data twice Database not normalized (for perfomance and space reasons) Family support (triage of offspring) Data export (Genepop). X and Y aware. interPopula – p.
  9. 9. HapMap example To have a feel of the interface... freqDB = Frequency() freqDB.requireChrPop(chr, pop) RSs = freqDB.getRSsForInterval(chr, startPos, endPos) for rs in RSs: #We get frequency information freqSNP = freqDB.getPopSNPs(pop, rs) nuc1, nuc2 = freqSNP[5], freqSNP[6] a1a1, a2a2, a1a2 = freqSNP[7], freqSNP[8], freqSNP[9] interPopula – p.
  10. 10. UCSC Known Genes support Everything is supported (not that much, just a long text file plus a link table) Get different IDs (Ascension ID, Prot ID, other links) What is near a certain genomic position (chromosome and position in chromosome) Get exons for a certain gene interPopula – p. 1
  11. 11. Integration Many examples provided on interoperability (with matplotlib, Entrez SNP, Genepop and Biopython) Integrating heterogeneous databases Databases do use different reference assemblies Example: The exon positions given by the last version of UCSC Table Browser are not compatible with HapMap (v37 vs v36) Silent bug where rarely applications crash and results seem correct This issue is discussed in the context of HapMap/TableBrowser/EntrezSNP and might be useful in other cases interPopula – p. 1
  12. 12. Examples – Known Genes interPopula – p. 1
  13. 13. Examples – HapMap/Integration interPopula – p. 1
  14. 14. Future work Focus on HapMap and maybe 1000 Genomes project The whole UCSC Table Browser will be spin off later in a different project Copy Number Variation support (since June on HapMap) Phasing support due very soon (like next week) Provide examples with genome wide association studies interPopula – p. 1